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The last decade has seen increasing use of Item Response theory in the 
examination of the qualities of language tests. Although it has sometimes been 
seen exclusively as a tool for improved investigation of the reliability of tests 
(Skchan, 1989), its potential for investigation of aspects of the validity of 
language tests has also been demonstrated (McNamara, 1990). However, the 
application of IRT in this latter role has in some cases met with objections based 
on what are claimed to be the unsatisfactory theoretical assumptions of IRT, in 
particular the so-called 'unidimensionalily' assumption (Hamp-Lyons, 1989). In 
this paper, these issues will be discussed in the context of the analysis of data 
from an ESP Listening test for health professionals, part of a larger lest, the 
Occupational English Test (OET), recently developed on behalf of the 
Australian Government (McNamara, 1989b). 

The paper is in three sections. First, there is a brief description of the 
Listening sub-test of the OET. Second, the appropriateness of the use of IRT in 
language testing research is discussed. Third, the use of IRT in the validation of 
the Listening sub-test of the OET is reported. In this part of the paper, the issue 
of unidimensionality is considered in the context of analysis of data from the two 
parts of this test. 



THE LISTENING SUB-TEST OF THE OCCUPATIONAL ENGLISH TEST 



The Occupational English Test (McNamara, 1989b) is administered to several 
(+ hundred immigrant and refugee health professionals wishing to lake up practice 
— _ in Australia each year. The majority of these are medical practitioners, but Hie 
following professional groups arc also represented: nurses, physiotherapists, 
r^j occupational therapists, dentists, speech pathologists and veterinary surgeons, 
^ among others. Responsibility for administering the test lies with the National 



Office for Overseas Skills Recognition (NOOSR), part of the Commonwealth 
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Government's Department of Employment, Education and Training. NOOSR 
was established in 1989 as an expanded version of what had been until then the 
Council for Overseas Professional Qualifications (COPQ). 

The OET is taken as one of three stages of the process of registration for 
practice in Australia (the other stages involve pencil-and-paper and practical 
assessments of relevant clinical knowledge and skills). Prior to 1987, the OET 
ws a test of general English proficiency and was attracting increasing criticism 
from test takers and test users in terms of its validity and reliability. In response 
to this, COPQ initiated a series of consultancies on reform of the test. The 
report' on the first of these, which was carried out by a team at Lancaster 
University, recommended the creation of a test which would (Alderson ct al., 
1986: 3) 



assess the ability of candidates to communicate effectively in the workplace. 

A scries of further consultancies (McNamara, 1987 ; McNamara, 1988a; 
McNamara, 1989a) established the form of the new test and developed and 
trialled materials for it. There are four sub-test, one each for Speaking, 
Listening, Reading and Writing. The format of the new test is described in 
McNamara (1989b). The validation of the Speaking and Writing sub-tests is 
discussed in McNamara (1990). 

The Listening sub-test is a 50-minute test in two parts. Part A involves 
listening to a talk on a professionally relevant subject. There are approximately 
twelve short answer questions, some with several parts; the maximum score on 
this part of the test is usually about twenty-five. Part B involves listening to a 
consultation between a general practitioner and a patient. There are 
approximately twenty short answer questions (again, some have several parts); 
the maximum score here is usually twenty-five, giving a total maximum score of 
approximately fifty on thirty-two items. Because of test security considerations, 
new materials are developed for each session of the test, which is held twice a 
year. 

Before going on to report on the use of IRT in the validation of the 
Listening sub-test of the OET, the debate about the appropriateness of the use 
of IRT in language testing research will be reviewed. 



Applications of IRT in language testing 

The application of IRT to the area of language testing is relatively recent. 
Oiler (1983) contains no reference to IRT in a wide-ranging collection. By contrast, 
IRT has featured in a number of studies since the early 1980s. Much of this work 
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has focused on the advantages of IRT over classical theory in investigating the 
reliability of tests (eg Hcnning, 1984). MorcsignificantisthcuseofIRTtoe»mine 
aspects of the validity, in particular the construct validity, of tests. 

de Jong and Glas (1987) examined the construct validity of tests of foreign 
language listening comprehension by comparing the performance of native and 
non-native speakers on the tests. It was hypothesized in this work that native 
speakers would have a greater chance of scoring right answers on items: this was 
largely borne out by the data. Moreover, items identified in the analysis as 
showing 'misfit* should not show these same properties in relation to native 
speaker performance as items not showing misfit (that is, on 'misfitting 9 items 
native speaker performance will show greater overlap with the performance of 
non-native speakers); this was also confirmed. The researchers coodode (de 
Jong and Glas, 1987: 191): 

Vie ability to evaluate a given fragment of discourse in order to understand 
what someone is meaning to say cannot be measured along the same 
dimension as the ability to understand aurally perceived text at the literal level. 
Items requiring literal understanding discriminate better between native 
speakers and non-native learners of a language and are therefore better 
measures of foreign language listening comprehension. 

This finding is provocative, as it seems to go against current views on the 
role of inferencing processes and reader/listener schemata is comprehension (cf. 
Carrel!, Dcvine and Eskcy, 1988; Widdowson, 1983; Nunan 1987a). One might 
argue that the IRT analysis has simply confirmed the erroneous assumption that 
the essential construct requiring measurement is whatever distinguishes the 
listening abilities of native- and non-native speakers. An alternative viewpoint is 
that there will in fact be considerable overlap between the abilities of native- and 
non-native speakers in higher-level cognitive tasks involved in discourse 
comprehension. If the analysis of listening test data reveals that all test items fail 
to lie on a single dimension of listening ability, then this is in itself a valid finding 
about the multi-dimensional nature of listening comprehension in a foreign 
language and should not be discounted. The point is that interpretation of the 
results of IRT analysis must be informed by an in principle understanding of the 
relevant constructs. 

In the area of speaking, the use of IRT analysis in the development of the 
Interview Test of English as a Second Language (1TESL) is reported in Adams, 
Griffin and Martin, 1987; Griffin, Adams, Martin and Tomlinson, 1988. These 
authors argue that their research confirms the existence of a hypothesized 
'developmental dimension of grammatical competence... in English S(econd) 
Language] Acquisition]' (1988: 12). This finding has provoked considerable 
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controversy. Spolsky (1988: 123), in a generally highly favourable review of the 
test, urges some caution in relation to the claims for its construct validity: 

The authors use their results to argue for the existence of a grammatical 
proficiency dimension, but some of the items are somewhat more general. 
TJie nouns, verbs and adjectives items for instance are more usually classified 
as vocabulary. One would have liked to see different kinds of items added 
until the procedure showed that the limit of the unidimensionality criterion 
had now been reached. 



Nunan (1988: 56) is quite critical of the test's construct validity, particularly 
in the light of current research in second language acquisition: 

The major problem that I have with the test... [is] that it fails adequately to 
reflect the realities and complexities of language development. 

Elsewhere, Nunan (1987b: 156) is more trenchant: 

(TJie test] illustrates quite nicely the dangers of attempting to generate models 
of second language acquisition by running theoretically unmotivated data 
from poorly conceptualized tests through a powerful statistical programme. 

Griffin has responded to these criticisms (cf Griffin, 1988 and the discussion 
in Nunan, 1988). However, more recently, Hamp-Lyons (1989) has added her 
voice to the criticism of the ITESL. She summarizes her response to the study 
by Adams, Griffin and Martin (1987) as follows (1989: 117): 

..This study... is a backward step for both language testing and language 
teaching. 

She takes the writers to task for failing to characterize properly the 
dimension of 'grammatical competence* which the study claims to have 
validated; like Spolsky and Nunan, she finds the inclusion of some content areas 
puzzling in such a test. She argues against the logic of the design of the research 
project (1989: 115): 

Their assumption that if the data fit the psychometric model they de facto 
validate the model of separable grammatical competence is questionable. If 
you constnict a test to test a single dimension and then find that it does indeed 
test a single dimension, how can you conclude that this dimension exists 
independently of other language variables? Vie unidimensionality, if that is 
really what it is, is an artifact of the [est development. 
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On the question of the unidimcnsionality assumption, Hamp-Lyons (1989: 
114) warns the developers of the ITESL test that they have a responsibility to 
acknowledge 

...the limitations of the partial credit model, especially the question of the 
unxdimensionality assumption of the partial credit model, the conditions 
under which that assumption can be said to be violated, and the significance 
of this for the psycholinguistic questions they arc investigating... Tltey need to 
note that the model is very robust to violations o] unidimcnsionality. 

She fi rlhcr (1989: 1 16) crilici/cs the developers of the ITESL for their 
failure ' t o consider the implications of the results of their lest development 
project for the classroom and the curriculum from which it grew. 

Hamp-Lyons's anxieties about the homogeneity of items included in the 
lest, echoed by Nunan and Spolsky, seem well-founded. But this is perhaps 
simply a question of revision of the lest content. More substantially, her point 
about the responsibilities of test developers to consider th> ackwash effects of 
their test instruments is well taken, although some practical uses of the lest seem 
unexceptionable (for example, as part of a placement procedure; cf the 
discussion reported in McNamara, 1988b: 57-61). Its diagnostic function is 
perhaps more limited, though again this could probably be improved by revision 
of the test content (although for a counter view on the feasibility of diagnostic 
tests of grammar, see Hughes, 1989: 13-14). 

However, when Adams, Griffin and Martin (1987: 25) refer to using 
information derived from the lest 

in monitoring and developing profiles, 

they may be claiming a greater role for the test in the curriculum. If so, this 
requires justification on a quite different basis, as Hamp-Lyons is right to point 
out. Again, a priori arguments about the proper relationship between testing and 
teaching must accompany discussion of research findings based on 1RT analysis. 

A more important issue for this paper is Hamp-Lyons's argument about the 
unidimcnsionality assumption. Here it seems that she may have misinterpreted 
the claims of the model, which hypothesizes (but docs not assume in the sense of 
'lake for granted* or 'require*) a single dimension of ability and difficulty. Its 
analysis of test data represents a test of this hypothesis in relation to the data. 
The function of the fit t-statistics, a feature of IRT analysis, is to indicate the 
probability of a particular pattern of responses (to an item or on the part of an 
individual) in the case that this hypothesis is true. Extreme values of t, 
particularly extreme positive values of t, arc an indication that the hypothesis is 
unlikely to be true for the term or the individual concerned. If items or 
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individuals arc found in this way ,o be discontinuing the hypothesis, this may be 
interpreted in a number of ways. In relation to items, it may indicate (1) that 
the item is poorly constructed; (2) that if the ucm is well-constructed, it docs 
not form part of the same dimension as defined by other items in the test, and is 
therefore measuring a different construct or trait. In relation to persons, it may 
indicate (1) that the performance on a particular item was not indicative of the 
candidate's ability in general, and may have been the result of irrelevant factors 
such as fatigue, inattention, failure to take the test item seriously, factors which 
Hcnning (1987: 96) groups under the heading of response validity; (2) that the 
ability of the candidate? involved cannot be measured appropriately b} the test 
instrument, that the pattern of responses cannot be explained in the same terms 
as applied to other candidates, that is, there is a heterogeneous test population in 
terms of the hypothesis under consideration; (3) that there may be surprising 
gaps in the candidate's knowledge of the areas covered by the test; this 
information can then be used for diagnostic and remedial purposes. 

A further point to note is that the dimension so defined is a measurement 
dimension which is constructed by the analysis, which must be distinguished 
from the dimensions of underlying knowledge or ability which may be 
hypothesized on other, theoretical grounds. IRT analyses do not 'discover' or 
*rcvcaF existing underlying dimensions, but rather construct dimensions for the 
purposes of measurement cn the basis of lest performance. The relationship 
between these two conceptions of dimensionality will be discussed further below. 

Hamp-Lyons is in effect arguing, then, that IRT analysis is insufficiently 
sensitive in lis ability to detect in the data departures from its hypothesis about 
an underlying ability-difficulty continuum. The evidence for this claim, she 
argues, is in a paper by Hcnning, Hudson and Turner (1985), in which the 
appropriateness of Rasch analysis with its attempt to construct a single 
dimension is questioned in the light of the fact that in language test data 
(Hcnning, Hudson and Turner, 1985: 142) 

...examinee performance is confounded with many cognitive and affective 
test factors such as test wisencss, cognitive style, test-taking strategy, 
fatigue, motivation and anxiety. Thus, no test can strict!;/ be said to 
measures one and only one trait. 

(In passing, it should be noted that these are not the usual grounds for objection 
to the supposedly unidimcnsional nature of performance on language tests, as 
these factors have been usefully grouped together elsewhere by Henning under 
the heading of response validity (cf above). The more usual argument is that the 
linguistic and cognitive skills underlying performance on language tests cannot be 
conceptualized as being of one type.) Hcnning ct al. examined performance of 
some three hundred candidates on the UCLA English as a Second Language 
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Placement Examination. There were 150 multiple choice items, thirty in each of 
five sub-tests: Listening Comprehension, Reading Comprehension, Grammar 
Accuracy, Vocabulary Recognition and Writing Error Detection. Relatively few 
details of each sub-test arc provided, although we might conclude that the first 
two sub-tests focus on language use and the other three on language usage. 
This assumes that inferencing is required to answer questions in the first two 
sub-tests; it is of course quite possible that the questions mostly involve 
processing of literal meaning only, and in that sense to be rather more like the 
other sub-tests (cf the discussion of this point in relation to dc Jong and Glas 
(1987) above). The data were analysed using the Rasch one-parameter model, 
and although this is not reported in detail, it is clear from Table two on p. 153 
that eleven misfitting items were found, with the distribution over the sub-tests 
as follows: Listening, 4; Reading, 4; Grammar, 1; Vocabulary, 3; Writing error 
detection, 3. (Interestingly, the highest numbers of misfitting items were in the 
Listening and Reading sub-test). One might reasonably conclude that the 
majority of test items may be used to construct a single continuum of ability and 
difficulty. We must say 'the majority* because in fact the Rasch analysis docs 
identify a number of items as not contributing to the definition of a single 
underlying continuum; unfortunately, no analysis is offered of these items, so we 
are unable to conclude whether they fall into the category of poorly written items 
or into the category of sound items which define some different kind of ability. 
It is not clear what this continuum should be called; as stated above, 
investigation of what is required to answer the items, particularly in the Reading 
and Listening comprehension sub-test, is needed. In order to gain independent 
evidence for the Rasch finding of the existence of a single dimension underlying 
performance on the majority of items in the test, Hcnning ct al. report two 
other findings. First, factor analytic studies on previous versions of the test 
showed that the test as a whole demonstrated a single factor solution. Secondly, 
the application of a technique known as the Bcjar technique for exploring the 
dimensionality of the test battery appeared to confirm the Rasch analysis 
findings. Subsequently, Henning et al.'s use of the Bcjar technique has 
convincingly been shown to have been unrcvealing (Spurling, 1987a; Spurring, 
1987b). Hcnning ct al. nevertheless conclude that the fact that a single 
dimension of ability and difficulty was defined by the Rasch analysis of their data 
despite the apparent diversity of the language subskills included in the tests 
shows that Rasch analysis is (Hcnning, Hudson and Turner, 1985: 152) 

sufficiently robust with regard to the assumption of unidimensionality to 
permit applications to the development and analysis of language tests. 
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(Note again in passing lhat the analysis by this point in the study is examining a 
rather different aspect of the possible inappropriatcness or otherwise of IRT in 
relation to language test data than that proposed earlier in the study, although 
now closer to the usual grounds for dispute). The problem here, as Hamp-Lyons 
is right 10 point out, is that what Hcnning ct al. call 'robustness 1 and take to be 
virtue leads to conclusions which, looked at from another point of view, seem 
worrying. That is, the unidimcnsional construct defined by the test analysis 
seems in some sense to be at odds with the a prion construct validity, or at least 
the face validity, of the test being analysed, and at the very least needs further 
discussion. However, as has been shown above, the results of the IRT analysis in 
the Hcnning stuc'y arc ambiguous, the nature of the tests being analysed is not 
clear, and the definition of a single construct is plausible on one reading of the 
sub-tests' content. Clearly, as the results of the dc Jong and Glass study show 
(and whether or not we agree with their interpretation of those results), IRT 
analysis is capable of defining different dimensions of ability within a test of a 
single language sub-S;jll, and is not necessarily 'robust 1 in that sense at all, that 
is, the sense that troubles Hamp-Lyons. 

In a follow-up study, Hcnning (1988: 95) found that fit statistics for both 
items and persons were sensitive to whether they were calculated in 
unidimcnsional or multidimensional contexts, that is, they were sensitive to 
'violations of unidimcnsionality*. (In this study, multidimensional^ in the data 
was confirmed by factor analysis.) However, it is not clear why fit statistics 
should have been used in this study; the measurement model's primary claims 
are about the estimates of person ability and item difficulty, and it is these 
estimates which should form the basis of argumentation (cf the advice on this 
point in rclr.tion to item estimates in Wright and Masters, 1982: 114-117). 

In fact, the discussions of Hamp-Lyons and Hcnning arc each marked by a 
failure to distinguish two types of model: a measurement model and a model of 
the various skills and abilities potentially underlying test performance. These arc 
not at all the same thing. The measurement model posited and tested by IRT 
analysis deals with the question, 'Does it make sense in measurement terms to 
sum so -:s on different parts of the test? Can all items be summed 
mcaningiuily? Arc all candidates being measured in the same terms? 1 This is the 
'unidimcnsionality' assumption; tic alternative position requires us to say that 
separate, qualitative statements about performance on each test item, and of 
each candidate, arc the only valid basis for reporting test performance. All tests 
which involve the summing of scores across different items or different test parts 
make the same assumption. It should be pointed out, for example, that classical 
item analysis makes the same 'assumption 1 of unidimcnsionality, but lacks tests 
of this 'assumption' to signal violations of it. As fcr 'he interpretation of test 
scores, this must be done in the light of the our best understanding of the nature 
of language abilities, that is, in the light of current models of the constructs 




models such as IRT, and both kinds of analysis have the potential to illuminate 
the nature of what is being measured in a particular language test. 

It seems, then, that Hamp-Lyons's criticisms of IRT on the score of 
unidimensionality are unwarranted, although, as stated above, results always 
need to be interpreted in the light of independent theoretical perspective. In 
fact, independent evidence (of example via factor analysis) may be sought fo«" «hc 
conclusions of an IRT analysis when there are grounds for doubting them, 
example when they appear to overturn long- or dearly-held beliefs about the 
nature of aspects of language proficiency. Also, without wishing to enter into 
Hamp-Lyons (1989: 114) calls 

the hoary issue of whether language competence is unitary or divisible, 

it is clear that there is likely to be a degree of commonality or shared variance on 
tests of language proficiency of various types, particularly at advanced levels (cf 
the discussions in Hcnning (1989: 98) and dc Jong and Hcnning (1990) of recent 
evidence in relation to this point). 

Hamp-L\ons (1989) contrasts Griffin ct al.'s work on the 1TESL with a 
study on writing development by Pollitt and Hutchinson (1987), whose approach 
she views in a wholly positive light. Analysis of data from performance by 
children in the middle years of secondary school on a scries of writing tasks in 
English, their mother tongue in most cases, led to the following finding (Pollitt 
and Hutchinson, 1987: 88): 

Different writing tasks make different demands, calling on different language 
functions and setting criteria for competence that are more or less easy to 
meet. 

Pollitt (in press, quoted in Skchan, 1989: 4) 

discusses how the scale of difficulty identified by IRT can be related to 
underlying cognitive stages in the development of a skill. 

For Hamp-Lyons (1989: 113), Pollitt and Hutchinson's work is also 
significant as an example of a valuable fusion of practical test development and 
theory building. 

Several other studies exist which use the IRT Rating Scale model (Andrich, 
1978a; Andrich, 1978b; cf Wright and Masters, 1982) to investigate assessments 
of writing (Hcnning and Davidson, 1987; McNamara, 1990), speaking 
(McNamara, 1990) and student self assessment of a range of language skills 
(Davidson and Hcnning, 1985). These will not be considered in detail here, but 
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demonstrate further the potential of IRT to investigate the validity of language 
assessments. 

THE OCT LISTENING SUB-TEST: DATA 

Data from 196 candidates who took the Listening sub-test in August, 1987 
were ava l Se for analysis using the Partial Credit Model (Wr.ght and Masters, 
1082) Ih the help of facilities provided by the Australian C^cd for Educauon 
Research The material used in the test had been trialled and subsequently 
Sd prior to its use in the full session of the OET. Part A of the test 
c^ ed P "short answer questions on a talk about communicator . between 
different Rroups of health professionals in hosp.tal sett.ngs. Part B of the test 
invo ved a guided history taking in note form based on a record.ng of a 
consultation between a doe or and a patient suffering headaches subsequent to a 
serious elr accident two years previously. Full details of the matenals and the 
trialling of the test can be found in McNamara (in prcparaHon). 

The analysis was used to answer the following question: 

1 Is it possible to construct a single measurement dimension of 'listening 
bi it ' from the data from the test as a whole? Does rt make sense to add 
the scores from the two parts of the L.sten.ng sub-test? That .s. .s the 
Listening test 'unidimensional'? 

If the answer to the first question is in the affirmative, can we distinguish 
he skillsTnvolved in the two Parts of the sub-test, or are essentially he 
same kills involved in both? That is, what does the test tel us abou the 
nam e o he listening skills being tapped in the two parts of the suMc t 
And from a practical point of view, if both sub-tests ™™[* h '™' 
skills, could one part of the sub-test be eliminated ,n the interests of 
efficiency? 

Two sorts of evidence were available in relation to the first question. 
1 1 !„,L« Part A and Part B were each treated as separate tests, and 
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should be identical; that is, estimates of person ability should be independent of 
!he part of the test on whieh that estimate is based. The analys.s was earned out 
using the programme MSTEPS (Wright, Congdon and Rossncr, 1987 . 

Using the data from both parts as a single data set, two candidates who got 
perfeet'seores were excluded from the analysis, leaving data from 194 candidates 
There were a maximum of forty-nine score points from the th.rty-two ..cms. 
Using data from Part A only, scores from five candidates who got perfect scores 
or scores of zero were excluded, leaving data from 191 candidates. There were 
maximum of twenty-four score points from twelve .terns. Usmg data front art 
B only scores of nineteen candidates with perfect scores were excluded, leaving 
Ita from 177 candidates. There were a maximum of twenty-five score point 
from twenty items. Table 1 gives summary statistics from each analysis, l he 
Z reliability of person separation (the proportion of the observed variance in 
logit measurements of ability which is not due to measurement error; WrgU 
and Masters, 1982: 105-106), termed the 'Rasch analogue of the familiar KR20 
ndex" by PdBtt and Hutchinson (1987: 82), is higher for the test as a whole ^than 
for either of the two parts treated independently. The figure for the test as a 
whole is satisfactory (.85). 



Tabic I Summary statistics, Listening sub-test 





Parts A and B 


Part A 


Part B 


N 


194 


191 


177 


Number of items 


32 


12 


20 


Maximum raw score 


49 


24 


25 


Mean raw score 


34.2 


14.4 


19.4 


S D (raw scores) 


95 


5.3 


4.5 


Mean logit score 


1.46 


0.86 


1.67 


S D (logjts) 


U3 


1.44 


125 


Mean eiror (logits) 


.48 


71 


.75 


Person separation 
reliabUity (like KR-20) 


.85 


74 


.60 



Table 2 jives information on misfitting persons and items in each analysis. 
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Table 2 Numbers of misfitting items and persons, Listening sub-test 

Parts A and B Part A Part B 

Items 2 (#7, #12) 2 (#7, #12) 1 (#25) 

Persons 2 1 5 



The analysis reveals that number of misfitting items is low. The same is 
true for misfitting persons, particularly for the test as a whole and Part A 
considered independently. Pollitt and Hutchinson (1987: 82) point out that we 
would normally expect around 2% of candidates to generate fit values above +2. 

On this analysis, then, it seems that when the test data are treated as single 
test, the item and person fit statistics indicate that all the items except two 
combine to define a single measurement dimension; and the overwhelming 
majority of candidates can be measured meaningfully in terms of the dimension 
of ability so constructed. Our first question has been answered in the 
affirmative. 

It follows that if the Listening sub-test as a whole satisfies the 
unidimensionality assumption, then person ability estimates derived from each of 
the two parts of the sub-test treated separately should be independent of the 
Part of the test on which they are made. Two statistical tests were used for this 
purpose. 

The first test was used to investigate the research hypothesis of a perfect 
correlation between the ability estimates arrived at separately by treating the 
data from Part A of the test independently of the data from Part B of the test. 
The correlation between the two sets of ability estimates was calculated, 
corrected for atttnuation by taking into account the observed reliability of the 
two parts of the test (Part A: .74, Part B: .60 - cf Table 1 above). (The 
procedure used and its justification arc explained in Hcnning, 1987: 85-86.) Let 
the ability estimate of Person n on Part A of the test be denoted by bnA and f he 
ability estimate of Person n on Part B of the test be denoted by bnB. The 
correlation between these two ability estimates, uncorrected for attenuation, was 
found to be .74. In order to correct for attenuation, we use the formula 
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Rxy 

Vrxx ryy 

where rxy = the correlation corrected for attenuation 
Rxy = the observed correlation, uncorrected 
rxx = the reliability coefficient for the measure of the variable x 
ryy = the reliability coefficient for the measure of the variable y 

and where if rxy > 1, report rxy = 1. 

The correlation thus corrected for attenuation was found to be > 1, and 
hence may be reported as 1. This test, then, enables us to reject the hypothesis 
that there is not a perfect linear relationship between the ability estimates from 

each part of the test, and ihus offers support for the research hypothesis that the 
true correlation is 1. 

The correlation test is only a test of the linearity of the relationship between 
the estimates. As a more rigorous test of the equality of the ability estimates^ a 
X 2 test was done. Let the 'true' ability of person n be denoted by Bn. Then bnA 
and bnB arc estimates of Bn. It follows from maximum likelihood estimation 
theory (Cramer, 1946) that, because bnA and bnB arc maximum likelihood 
estimators of Bn (in the case when both sets of estimates arc centred about a 
mean of zero), 

tvi/l-N (Bn,cnA) 

where cnA is the error of the estimate of the estimate of the ability of Person n 
on Part A of the test and 

2 

tyi£~N (Bnl.cnB) 

where viB is the error of the estimate of the ability of Person n on Part B of the 
test. 

From Table 1, the mean logit score on Part B of the test is 1.67, while the 
mean logit score on Part A of the test is .86. As the mean ability estimates for 
the scores on each part of the test have thus not been set at zero (due to the fact 
that items, not people, have been centred), allowance must be made for the 
relative ilifficulty of each part of the test (Part B was considerably less difficult 
than Part A). On average, then, bnB - bnA = .81. It follows that if the 
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hypothesis that the estimates of ability from the two parts of the test are identical 
is true, then bnB - bnA - .81 = 0. It also follows from above that 

2 2 

bnB - bnA - .81 ~N(0, cnB + aiA) 
and thus that 

bnB - bnA - .81 

~N (0,1) 

P 1 

J enB + anA 

if the differences between the ability estimates (corrected for the relative 
difficulty of the two parts of the test) are converted to z-scores, as in the above 
formula. If the hypothesis under consideration is true, then the resulting set of 
z-scores will have a unit normal distribution; a normal probability plot of these z- 
scorcs can be done to confirm the assumption of normality. These z-scores for 
each candidate are then squared to get a value of X 2 for each candidate. In 
order to evaluate the hypothesis under consideration for the entire set of scores, 
then the test statistic is 



2 N 

/ N- 1 i = l 

where N = 174 

The resulting value of X ? ' is 155.48,df = 173, p = .84. (The normal 
probability plot confirmed that the z-scores were distributed normally). The 
second statistical test thus enables us to reject the hypothesis that the ability 
estimates on the two parts of the test are not identical, and thus offers support 
for the research hypothesis of equality. 

The two statistical tests thus provide strong evidence for the assumption of 
unidimensionality in relation to the test as a whole, and confirm the findings of 
the analysis of the data from the whole test taken as a single data set. In contrast 
to the previously mentioned study of Henning (1988), which relied on an analysis 
of fit statistics, the tests chosen are appropriate, as they depend on ability 
estimates directly. 




Now that the unidimensionality of the test has been confirmed, 
performance on items on each part of the test may be considered. Figure 1 is a 
map of the difficulty of items using the data from performance on the test as a 
whole (N = 194). 



Figure 1 Item difficulty map 
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Figure 1 reveals that the two Parts of the lest occupy different areas of the 
map, with some overlap. For example, of the eight most difficult items, seven 
arc from Part A of the test (Part A contains twelve items); conversely, of the 
eight easiest items, seven arc from Part B of the test (Part B has twenty items). 
It is clear then that differing areas of ability arc tapped by the two parts of the 
lest. This is most probably a question of the content of each part; Part A 
involves following an abstract discourse, whereas Part B involves understanding 
details of concrete events and personal circumstances in the case history. The 
two types of listening task can be viewed perhaps in terms of the continua more 
or less cognitive!}' demanding and more or less context embedded proposed by 
Cummins (1984). The data from the test may be seen as offering support for a 
similar distinction in the context of listening tasks facing health professionals 
working through the medium of a second language. The data also offer evidence 
in support of the content validity of the test, and suggest that the two parts are 
sufficiently distinct to warrant keeping both. Certainly, in terms of backwash 
effect, one would not want to remove the part of the test which focuses on the 
consultation, as face-to face communication with patients is perceived by former 
test candidates as the most frequent and the most complex of the communication 
tasks facing them in clinical settings (McNamara, 1989b). 

The interpretation offered above is similar in kind to that offered by Pollitt 
and Hutchinson (1987) of task separation in a test of writing, and further 
illustrates the potential of IRT for the investigation of issues of validity as well as 
reliability in language tests (McNamara, 1990). 



CONCLUSION 

An IRT Partial Credit analysis of a two-part ESP listening test for health 
professionals has been used in this study to investigate the controversial issue of 
test unidimensionality, as well as the nature of listening tasks in the test. The 
analysis involves the use of two independent tests of unidimensionality, and both 
confirm the finding of the usual analysis of the test data in this case, that is, that 
it is possible to construct a single dimension using the items on the test for the 
measurement of listening ability in health professional contexts. This 
independent confirmation, together with the discussion of the real nature of the 
issues involved, suggest that the misgivings sometimes voiced about the 
limitations or indeed the inappropriatcness of IRT for the analysis of language 
test data may not be justified. This is not to suggest, of course, that we should be 
uncritical of applications of the techniques of IRT analysis. 

Moreover, the analysis has shown that the kinds of listening tasks presented to 
candidates in the two parts of the test represent significantly different tasks in terms 
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of Ihc level of ability required to deal successfully with them. This further coofinai 
the useful role of IRT in the investigation of the content and construct validity of 

language tests. 
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